docs: EP-1270 Authorization (access control) design proposal by davidkarlsen · Pull Request #2075 · kagent-dev/kagent

davidkarlsen · 2026-06-23T10:57:36Z

Summary

Adds an Enhancement Proposal for authorization (access control) in KAgent — issue #1270.

Today the controller ships with NoopAuthorizer, so once a user is authenticated they can list, invoke, edit and delete every Agent, ModelConfig and ToolServer across every namespace. Enabling OIDC (#1293) gives authentication but no access control. This EP proposes the fine-grained authorization that EP-476 explicitly deferred.

Approach

The earlier #1270 discussion stalled on a design tension: an opinionated in-process RBAC engine vs. a pluggable extension point. The EP proposes CEL as the resolution — it's both:

In-process default, no new SPOF (cel-go is already in our module graph), and
Not a hard-coded RBAC model — policy is an expression over claims/verb/resource, so groups are one option among many and the project isn't married to one engine.

The auth.Authorizer interface stays the seam, so an external/OPA authorizer (#1370) remains pluggable. Per-resource policy lives on the Agent CR, compiled via reconciliation (cached, validated onto status.conditions), enforced centrally. Builds on the stalled prototypes in #1766 (per-agent annotation + list filtering + A2A gating) and #1370 (external authorizer interface) rather than starting over.

Design comment that led here: #1270 (comment)

Status

provisional — following the "merge early and iterate" guidance in the EP template. High-level direction is the goal; details (per-resource carrier, policy-combining semantics, default-deny behavior) are flagged as Open Questions / UNRESOLVED for discussion.

Looking for a maintainer sponsor and a directional 👍 on "CEL as the default, behind the existing interface."

/cc @EItanya @peterj

🤖 Generated with Claude Code

davidkarlsen · 2026-06-23T10:59:13Z

@EItanya @peterj PTAL

Copilot

Pull request overview

Adds a new Enhancement Proposal (EP-1270) documenting a design for introducing fine-grained authorization (access control) in KAgent, centered on CEL-based policy evaluation while preserving the existing auth.Authorizer seam for pluggable implementations.

Changes:

Introduces EP-1270 documenting current authorization gaps and the proposed CEL-based default authorizer.
Specifies a policy model, decision context, and rollout strategy (opt-in, fail-closed, cached compilation).
Outlines operational considerations (list filtering, A2A gating) and an initial test plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Proposes a real Authorizer to replace the open-by-default NoopAuthorizer: CEL-based, in-process, behind the existing auth.Authorizer interface, with per-resource policy on the Agent CR compiled via reconciliation and a default-deny model. Builds on the stalled prototypes in kagent-dev#1766 and kagent-dev#1370. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: David J. M. Karlsen <david@davidkarlsen.com>

Address PR review: ProxyAuthenticator only populates Principal.Claims for direct user calls; the agent-call path (X-Agent-Name) sets User/Agent but not Claims. Qualify the Background statement and strengthen Open Question kagent-dev#5 — a claims-only fail-closed policy would deny internal agent/M2M traffic, so the model needs an agent-identity match or a separate M2M lane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: David J. M. Karlsen <david@davidkarlsen.com>

davidkarlsen · 2026-06-25T18:52:31Z

@dimetron PTAL?

dimetron · 2026-06-26T19:48:29Z

The proposal picks the right engine (CEL) and reuses the existing
auth.Authorizer.

What's good (approve)

CEL is the right engine. It runs in-process, it is sandboxed and guaranteed to
terminate, it evaluates in microseconds, and cel-go is already in the module
graph, so it adds no new single point of failure.
It keeps auth.Authorizer as the extension seam, so the external OPA option
(External auth init #1370) stays a drop-in.
Compiling policy at reconcile time into a generation-keyed cache, with a lazy
fallback on cache miss, is the right pattern for a controller.
The default path is fail-closed, opt-in, and backward compatible, and a bad
policy shows up on status.conditions as AccessPolicyValid.
It builds on the stalled prototypes (feat: group-based agent authorization via OIDC groups claim #1766 for list filtering and the A2A
gate, External auth init #1370 for the external authorizer) instead of starting over.
The non-goals and alternatives are honest. Casbin, OPA-only, a bespoke DSL,
and SubjectAccessReview each get dismissed with a real reason.

What he missed (must address before `accepted`)

Coverage is overstated. The EP says authz is "wired into every handler
(~25 call sites)", but only about 8 of the ~22 handler areas gate anything
today. Sessions, memory, tasks, model-provider config, companion secrets, and
A2A invoke are all ungated, and those are the most sensitive surfaces. The EP
needs an honest coverage matrix (below) and a commitment to close them.
M2M and agent calls carry no claims, so a fail-closed claims-based policy
would deny all internal A2A traffic. That is a blocker, not an open question.
The cleanest fix is workload identity (see below), which also removes the
spoofable X-Agent-Name header the agent path trusts today.
Policy combining is left as UNRESOLVED, but it is core semantics and nothing
can be implemented without it. Pin it down: default-deny, allow if either the
central policy or the matching per-resource policy permits, and only consult
a per-resource policy for its own resource. That last rule is what
structurally guarantees the widen-only invariant, so the EP should say so.
There is no invoke verb. A2A invocation collapses onto get, so a policy
cannot tell "may read" apart from "may run". Add VerbInvoke.

M2M: use workload identity, keep one policy model

Today the agent path identifies the caller from the unverified X-Agent-Name
header and sets no claims. Instead, let the M2M caller present a verified
identity token and bind it into the principal. Any of these works, in order of
how native they are to the stack:

A Kubernetes projected ServiceAccount token, with sub
system:serviceaccount:<ns>:<sa>, audience-bound to kagent.
A SPIFFE JWT-SVID, with sub spiffe://<trust-domain>/ns/<ns>/sa/<sa>.
Istio's default mTLS (X.509-SVID), where the sidecar forwards the peer SPIFFE
ID in the X-Forwarded-Client-Cert header.

The authorizer then populates Principal.Claims and Principal.Agent.ID from
that verified identity, so the same CEL model covers humans and machines with no
separate lane:

// SPIFFE JWT-SVID
claims.sub == "spiffe://kagent.local/ns/kagent/sa/agent-runner"

// or a projected Kubernetes ServiceAccount token
claims.sub == "system:serviceaccount:kagent/agent-runner"

This resolves the M2M blocker and the spoofable-header risk at once. Keep it
simple in the first cut: pick one carrier (projected SA token is the most
native), verify the audience, parse the identity into structured fields once,
and leave on-behalf-of user delegation (#2071 / STS) for later. Workload
identity tells you which agent called, not on whose behalf.

Enforcement model: middleware vs per-handler (root cause of gap #1)

The coverage gap is not really "14 handlers forgot a check". It is the
enforcement model. Authz today is opt-in per handler, with Check() scattered
across 8 files, which makes it default-open: any route without a Check() call
is silently a bypass. A one-time audit fixes today's snapshot, but it rots the
moment someone adds a route.

Recommendation: add a deny-by-default authz middleware as the backstop, and keep
per-handler Check() for the cases the middleware cannot cover. A route that is
not explicitly mapped to a (resourceType, verb) is denied.

This fits the code as it stands:

The server already runs a middleware chain (s.router.Use(...)) with an
AuthnMiddleware sibling, so an AuthzMiddleware slots in right after it
(go/core/internal/httpserver/server.go:356-360).
Routes use gorilla/mux with {namespace} and {name} vars, so the middleware
can read the resource name and namespace from mux.Vars(r) and the verb from
the HTTP method (the same switch handlers.Check already uses).

What stays in the handler, because the middleware cannot do it:

Concern	Where	Why
Coarse gate, "may you touch this resource type and verb at all"	Middleware (deny-by-default)	One chokepoint; closes the gap structurally
List filtering, per returned item	Handler	Needs the response set, not just the request
Create where name and namespace are in the body	Handler	Middleware sees path vars, not the decoded body
Per-resource policy combining (Agent `spec.accessPolicy`)	Handler	Needs the fetched resource plus central-vs-resource combining
Non-uniform routes (A2A `/{ns}/{name}`, sessions and memory keyed by agent)	Both, with an explicit route entry	Resource identity is not inferable from the path shape alone

This needs two pieces: a declarative registry that maps each route to its
resource type, verb, and whether it is public, and an explicit public allowlist
(/health, /version, and the self-scoped /api/user) so probes and
self-calls keep working.

Why it is worth the refactor: a missing registry entry fails closed (the request
is denied), whereas a missing Check() today fails open (the request goes
through). That asymmetry is the whole reason to do it. The same middleware also
covers the A2A PathPrefix handler (server.go:347) instead of leaving it to a
separate hand-wired gate. The EP should state this enforcement-model choice
explicitly rather than inheriting the implicit per-handler one.

Smaller suggestions

Keep authorizer set to noop whenever auth.mode=unsecure. Dev clusters
have no claims, so a claims-based policy would lock them out.
Add Namespace to auth.Resource, or normalize it once inside the
authorizer, so handlers stop re-parsing namespace/name.
Add an observability section: decision metrics
(kagent_authz_decisions_total{result,resource_type},
kagent_authz_config_valid) and deny logging at V(1) that never prints
claim values.
Add a threat-model and trust-boundary paragraph. The proxy validates the JWT
and the controller trusts the proxy. List what authz does not cover (direct
pod access in Secure the kagent UI #2028, and the ungated endpoints).
Add break-glass and bootstrap-admin guidance so turning authz on against a
live cluster does not lock everyone out.
Note that the Casbin sections in EP-476 are superseded by EP-1270.
Note the size limits: a CEL string in spec.accessPolicy counts against the
etcd object limit, the central ConfigMap is capped at 1 MiB, and the
compiled-program cache currently only evicts on delete.

Authorization gates matrix

Verified against main by counting Check() and authorizeAgentRequest calls
per handler file in go/core/internal/httpserver/handlers.

Handler	Routes (examples)	Gates today	Sensitivity	Risk if CEL enabled
`agents.go`	`/api/agents/*`	yes, ~12 (incl. `authorizeAgentRequest`)	high	covered
`modelconfig.go`	`/api/modelconfigs/*`	yes, 5	high (cred refs)	covered
`prompttemplates.go`	`/api/prompttemplates/*`	yes, 5	medium	covered
`toolservers.go`	`/api/toolservers/*`	yes, 4	medium	covered
`toolservertypes.go`	`/api/toolservertypes/*`	yes, 1	low	covered
`mcpapps.go`	`/api/mcpapps/*`	yes, 1	low	covered
`substrate.go`	`/api/substrate/*`	yes, 1	medium	covered
`sessions.go`	`/api/sessions/*`	none	high (conversation content)	bypass
`memory.go`	`/api/memories/*`	none	high (embeddings, PII)	bypass
`tasks.go`	`/api/tasks/*`	none	high (task data)	bypass
`checkpoints.go`	LangGraph checkpoints	none	high (state)	bypass
`modelproviderconfig.go`	`/api/modelproviderconfigs/*`	none	high (credential-adjacent)	bypass
`models.go`	`/api/models`	none	medium	bypass
`namespaces.go`	`/api/namespaces`	none	medium (enumeration)	bypass
`tools.go`	`/api/tools`	none	low to medium	bypass
`feedback.go`	`/api/feedback/*`	none	low	bypass
`crewai.go`	CrewAI routes	none	medium	bypass
`agentharness_gateway.go`	`/api/agentharnesses/...`	none	medium	bypass
`agentharness_session.go`	harness sessions	none	medium	bypass
`companion_secrets.go`	companion secrets	none	high (secrets)	bypass
`current_user.go`	`/api/user`	none	low (self)	acceptable
`health.go`	`/health`, `/version`	none (by design)	none	acceptable
A2A invoke	`/api/a2a/{ns}/{name}`	authn only (`A2AAuthenticator`), no `Authorizer`	high (direct agent run)	bypass

Summary: about 8 of the ~22 handler areas gate authz today, all of them CRUD on
Agent, ModelConfig, ToolServer, and PromptTemplate. The other half includes the
most sensitive surfaces: sessions, memory, companion secrets, model-provider
config, and A2A invoke. Turning on CELAuthorizer without the coverage work
leaves those as bypass paths, which is why the "wired into every handler" line
has to become a real coverage matrix and a scope commitment.

Bottom line

The direction is right: CEL as the default behind the existing interface. Merge
it as provisional. Before it moves to accepted:

Replace the coverage claim with the matrix above, commit to closing the
sensitive gaps, and adopt a deny-by-default authz middleware (hybrid with
per-handler Check()) so the gap cannot come back.
Move M2M principals and policy combining out of "Open Questions" and into
resolved design. For M2M, adopt verified workload identity (projected SA
token or SPIFFE) bound into the principal, not the X-Agent-Name header.
Add the observability and threat-model sections.

References

Related PRs and issues:

#1766: per-agent
kagent.dev/allowed-groups annotation, GroupAuthorizer, agent-list
filtering, and A2A request gating (stalled on inactivity).
#1370: pluggable external
authorizer (OPA-style webhook) behind the Authorizer interface (stalled).
#1293: OIDC proxy
authentication (EP-476), which added the trusted-proxy mode this EP builds on.
#2071: STS and delegated
identity, relevant to propagating caller claims on the agent path.
#2028: network gating
(HTTPRoute, NetworkPolicy, OpenShift Route), the out-of-scope edge controls.

Source locations (verified against main):

NoopAuthorizer.Check returns nil: go/core/internal/httpserver/auth/authz.go.
Authorizer interface, the Verb set (get/create/update/delete),
Resource{Name, Type} with no Namespace field, and Principal.Claims:
go/core/pkg/auth/auth.go (Verb at lines 9-16, Resource at 18-21,
Principal at 32-36).
Central Check helper and the HTTP-method-to-verb switch:
go/core/internal/httpserver/handlers/helpers.go:56-80.
Middleware chain, where an AuthzMiddleware would sit next to
AuthnMiddleware: go/core/internal/httpserver/server.go:356-360.
A2A registered as a PathPrefix handler with authentication only, no
authorizer: go/core/internal/httpserver/server.go:347.
cel-go already in the module graph (indirect today, promote to direct when
implementing): go/go.mod.
Agent status conditions, where AccessPolicyValid would be reported:
go/api/v1alpha2/agent_types.go.

Copilot AI review requested due to automatic review settings June 23, 2026 10:57

github-actions Bot added documentation Improvements or additions to documentation enhancement-proposal Indicates that this PR is for an enhancement proposal labels Jun 23, 2026

Copilot started reviewing on behalf of davidkarlsen June 23, 2026 10:58 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread design/EP-1270-Authorization.md Outdated

davidkarlsen force-pushed the feat/ep-authz branch from 430afb4 to c86e289 Compare June 23, 2026 13:27

davidkarlsen and others added 2 commits June 23, 2026 15:28

davidkarlsen force-pushed the feat/ep-authz branch from c86e289 to a5366a4 Compare June 23, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: EP-1270 Authorization (access control) design proposal#2075

docs: EP-1270 Authorization (access control) design proposal#2075
davidkarlsen wants to merge 2 commits into
kagent-dev:mainfrom
davidkarlsen:feat/ep-authz

davidkarlsen commented Jun 23, 2026

Uh oh!

davidkarlsen commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

davidkarlsen commented Jun 25, 2026

Uh oh!

dimetron commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

davidkarlsen commented Jun 23, 2026

Summary

Approach

Status

Uh oh!

davidkarlsen commented Jun 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

davidkarlsen commented Jun 25, 2026

Uh oh!

dimetron commented Jun 26, 2026

What's good (approve)

What he missed (must address before accepted)

M2M: use workload identity, keep one policy model

Enforcement model: middleware vs per-handler (root cause of gap #1)

Smaller suggestions

Authorization gates matrix

Bottom line

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

What he missed (must address before `accepted`)